May 26, 2020

Course Introductions

What will be doing in this course?

  • Learning how to program / code in R!
  • Learning some basic statistics and tools for data analysis
  • Gain practical skills for career / research practice

Why learn Data Science & R?

  • In-demand skill
  • Format of Data is changing and traditional tools like Microsoft Excel is insufficient for certain tools and functions
  • Easy to carry out functions like Webscrapping, Machine Learning, Statistical Analysis and Web / Dashboard building

Course Outline

  • Class 1:
    • Setting Up
    • RMarkdown
    • Reading and Writing Data
  • Class 2:
    • Introduction to the Tidyverse
    • Simple data manipulation with dplyr
    • Data visualization with ggplot2
  • Class 3:
    • Basic Data Analysis Process
    • Exploratory Data Analysis
    • Dealing with Missing Data
    • A simple hacking project

Course Outline

  • Class 4:
    • Function Writing with purrr
    • Some Statistical Analysis with psych and base R
  • Class 5:
    • Statistical Analysis
  • Class 6:
    • Basic Machine Learning Concepts with caret

Course Outline

  • Class 7:
    • Working with strings using stringr and rebus packages
    • Simple NLP using tidytext, tm and wordcloud
  • Class 8:
    • Webscrapping with rvest
    • API with httr
  • Class 9:
    • Dashboard & Website Building with shiny

Let’s Get Started

  • Download R
  • Download R Studio
  • Open R Studio and Explore the Console
  • Helpful Reading: Hadley Wickham’s book

What is R?

  • R is a programming language that is commonly used for data science
  • It is open-source, with a strong community of users and 14,837 packages available!
  • It is free to use unlike Matlab, SPSS, Stata
  • R is an easy first programming language to learn
  • Less popular that Python but not second fiddle in my opinion (especially in the field of data science)
  • Rstudio is an arguably better IDE than Jupyter (IMO it is the Apple to the Microsoft)

What is R Studio?

  • R Studio is the IDE that allows you to code in R (and other languages like SQL and python)
    • IDE stands for Integrated Development Environment
    • Allows you to write functions and operations
    • Tidy and Visualize data

RStudio

  • You have the Console, Environment, Files and Packages

What is Markdown?

  • R Markdown is the tool for you to report and present your code / output
    • File -> New File -> R Markdown -> Document -> HTML
  • Set output to different file formats (with the Knit button)

Markdown Syntax

  • Need to learn markdown syntax
  • Use the cheat sheet as a reference (google markdown cheat sheet)
  • You can use Markdown to embed formatting instructions into your text. For example, you can make a word italicized by surrounding it in asterisks, bold by surrounding it in two asterisks, and monospaced (like code) by surrounding it in backticks:

*italics*, **bold**, `code`

  • You can turn a word into a link by surrounding it in hard brackets and then placing the link behind it in parentheses, like this:

[Columbia U](www.columbia.edu)

R Markdown Cheat Sheet

R Markdown Cheat Sheet

Headers

To create titles and headers, use leading hastags. The number of hashtags determines the header’s level:

# First level header
## Second level header
### Third level header

Lists

To make a bulleted list in Markdown, place each item on a new line after an asterisk and a space, like this:

* item 1
* item 2
* item 3

You can make an ordered list by placing each item on a new line after a number followed by a period followed by a space.

1. item 1
2. item 2
3. item 3

Embedding equations

You can also use the Markdown syntax to embed latex math equations into your reports. To embed an equation in its own centered equation block, surround the equation with two pairs of dollar signs like this,

$$1 + 1 = 2$$

To embed an equation inline, surround it with a single pair of dollar signs, like this: $1 + 1 = 2$

All standard Latex symbols work.

Including R code inline and in chunks

  • R code can be included as chunk with

    ```{r} ```

    or inline with a single tickmark.

  • R functions sometimes return messages, warnings, and even error messages. By default, R Markdown will include these messages in your report. You can use the message, warning and error options to prevent R Markdown from displaying these.

  • Keyboard Shortcut to create a new chunk is command + option + I

Popular chunk options

Knitr

knitr is an engine for dynamic report generation with R and is used to convert (or “knit”) R Markdown files into the desired output format.

Other Output Formats

  • html_document
  • pdf_document
  • word_document
  • beamer_presentation / slidy_presentation / ioslides_presentation
  • github_document

Packages and Dependencies

  • Installing packages
#install.packages("dplyr")
library(dplyr)
  • Or use a package manager, e.g. Pacman
#install.packages("pacman")
library(pacman)
p_load(dplyr, ggplot2)

Core Packages in R

  • ggplot2 (graphics)
  • tibble (data frames and tables)
  • tidyr (make tidy)
  • readr (read in tabular formats)
  • purrr (functional programming)
  • dplyr (manipulate data)
  • tidyverse (All the above)

Importing / Reading in Data

  • R comes with several free datasets that are easy to practice and learn with
  • For instance, mtcars
  • To find out all the inbuilt dataset, just type data() in the console
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Importing / Reading in Data

  • Reading data from the web
# Using the data.table package to read files
p_load(data.table)
flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")

Importing / Reading in Data

  • Reading data from the Computer
  • Check working directory / Set working directory
# Check working directory
getwd()
## [1] "/Users/geraldlee/Documents/Intro to R"
# Set working directory
setwd('/Users/geraldlee/Documents/Intro to R')

Importing / Reading in Data

  • Reading data from the Computer
# Using the readxl package to read in Excel files
library(readxl)
rawData <- read_excel(path = "data/data_example1.xlsx", # Path to file
                    sheet = 2, # We want the second sheet
                    skip = 1, # Skip the first row
                    na = "NA") # Missing characters are "NA"
# Or fread
rawData <- fread("data/data_example1.xlsx")

Take a Look at the Data

head(flights) # head() / tail() to show 5 top/bottom rows
##    year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475
## 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475
## 3: 2014     1   1         2         9      AA    JFK  LAX      351     2475
## 4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035
## 5: 2014     1   1         2         1      AA    JFK  LAX      350     2475
## 6: 2014     1   1         4         0      AA    EWR  LAX      339     2454
##    hour
## 1:    9
## 2:   11
## 3:   19
## 4:    7
## 5:   13
## 6:   18

Another Way to Look at Data

dim(flights) # Get the shape of the data
## [1] 253316     11
colnames(flights) # Get the column names
##  [1] "year"      "month"     "day"       "dep_delay" "arr_delay" "carrier"  
##  [7] "origin"    "dest"      "air_time"  "distance"  "hour"

Seeking help

  • Look at the help tab of your console
?ggplot2
help(dplyr)

Data Science Flow Chart

  • We have just explored the 1st part (Importing data)
  • This course will focus on Tidying, Transforming and some Visualisation
  • Not a full-fledged Statistics or Machine Learning Course

Possible Projects